-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add support for DELTA_BINARY_PACKED and DELTA_BYTE_ARRAY encodings to Parquet reader #12948
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…into feature/validate_encodings
…into feature/validate_encodings
…into feature/validate_encodings
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Bump up JNI version to 23.08.0-SNAPSHOT in branch-23.08 Authors: - Peixin (https://github.com/pxLi) Approvers: - Nghia Truong (https://github.com/ttnghia) - Jason Lowe (https://github.com/jlowe) URL: rapidsai#13401
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
Forward-merge branch-23.06 to branch-23.08
…ply (rapidsai#13429) Closes rapidsai#13426 Authors: - https://github.com/brandon-b-miller Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Bradley Dice (https://github.com/bdice) URL: rapidsai#13429
) Cleans up source files for nvtext and io-text pytests. The pytests are placed into separate files: `test_io_text.py` for the io-text pytests and `test_nvtext.py` for the nvtext pytests. Also removed the `python/cudf/cudf/tests/text` folder which contained 2 empty `.py` files. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: rapidsai#13435
This PR attempts to allow using newer versions of scikit-build again. cf. rapidsai#13188 Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - AJ Schmidt (https://github.com/ajschmidt8) - Lawrence Mitchell (https://github.com/wence-) URL: rapidsai#13424
closes rapidsai#13412 Remove weak references of cleaned resources when a resource is cleaned. The cleaned objects are never leaked, it's safe to remove the weak references. This is to reduce the memory usage. Authors: - Chong Gao (https://github.com/res-life) Approvers: - Jason Lowe (https://github.com/jlowe) - Robert (Bobby) Evans (https://github.com/revans2) - MithunR (https://github.com/mythrocks) URL: rapidsai#13378
Forward-merge branch-23.06 to branch-23.08
Depends on: rapidsai/rapids-cmake#393 Once the above PR is merged, this updated logic ensures that cudf places the custom versions of cccl packages in correct places, and can find them once installed. Authors: - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#13235
Closing for now. Will resubmit as part of #13501 |
vyasr
added
4 - Needs Review
Waiting for reviewer to review or respond
and removed
4 - Needs cuIO Reviewer
labels
Feb 23, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
4 - Needs Review
Waiting for reviewer to review or respond
CMake
CMake build issue
cuIO
cuIO issue
feature request
New feature or request
Java
Affects Java cuDF API.
libcudf
Affects libcudf (C++/CUDA) code.
non-breaking
Non-breaking change
Python
Affects Python cuDF API.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Some Parquet writers will fall back to DELTA_BINARY_PACKED or DELTA_BYTE_ARRAY encoding when dictionary encoding cannot be used. This PR is a first attempt at adding support for these encodings to the Parquet reader. A description of these encodings can be found starting here.
I'm mostly looking for feedback on my approach right now. In particular, the final decoding of strings in DELTA_BYTE_ARRAY. Each string is encoded as a prefix length from the preceding string, a suffix length, and then the suffix bytes. To reconstruct
string_i
, you needprefix_length(i)
bytes fromstring_(i-1)
, which at first blush seems to be a serial operation. I've used a few cheats to try to be a bit more parallel, but am open to suggestions to make it even more so. The logic for this is in theStringScan
function starting at line 2105 of page_data.cu.I'm also wondering if it makes more sense to use all 128 threads to do decoding, rather than the current approach of using one warp for rep/def level decoding, one or two warps for delta decoding, and one warp outputting values (which mirrors how the current decoder works).
Checklist